Inspired from the pipeline by Pierre-Luc Germain presented in course STA426 (UZH) Source code is available on GitHub : https://github.com/sta426hs2020/project-cite-seq-analysis

Introduction

Motivations

Skin cancer is a disease that afflicts about 3.3M Americans each year (about 1% of the population) and is by far the most prevalent type of cancer. In fact, more people are diagnosed with skin cancer each year in the U.S. than all other cancers combined ((“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020)). A common type of skin cancer is called squamous cell carcinoma (SCC). This is the second most common form of skin cancer and afflicts roughly 1M Americans a year. The main causes of SCC are mutations arising from UV radiation ((“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020)). Actinik keratosis (AK) is the most common form of precancer that results from long-term exposure to UV rays. Over time, actinic keratoses may develop into squamous cell carcinoma. This condition affects more than 58 million americans ((“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020)). Understanding the heterogeneity of skin cancer cells as wall as their cell type composition over disease development are key to find new therapeutic interventions (McFarland et al. (2020))

– optional if we find something. Particularly, gaining an understanding of the biology of Cancer stem cells (CSC) is important, as they lead to metastasis and resistance to therapies. (link to )

In the following report, we analyze single cell RNA data stemming from 2 patients, one with SCC and one with AK.

Data

In the following we will be looking at CITE-seq data gathered by Levesque’s group for two patients. For both patients we have a sample of their healthy skin and and then a sample of either cancer or precancer skin. More precisely, patient one has a squamous cell carcinoma (SCC) sample and patient two an actinic keratosis (AK), a pre-maligant lesion, sample. Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) data is a single-cell sequencing method which not only sequences the transcriptome but also give information about the cell’s surface proteins. To this end surface proteins of interest are stained with antibody-derived tags (ADTs), which are then simultaneously sequenced with the transcriptome. The sequencing data is afterward analyzed using 10X genomics cell ranger pipeline. This pipeline will slightly clean up the data, align it to a genome of interest and compute a counts matrix containing the number of transcripts of a specific gene per cells. Optionally, a filtering step can be included which gets rid of empty droplets. In the end we have for each patient and each sample type, a raw and a filtered count matrix. After looking at the ADT data, we realized that a problem might have occurred in the experiment and we therefore decided to discard this part of the data for our analysis. This problem might have been caused by physical aggregation of antibodies or by undersequencing of the ADT library.

Bioconductor

One of the tools we used quite extensively is the Bionconductor package. Bioconductor contains a pool of tools, which are very usefull for the analysis of highthrouput genomic data. Given our data, we mainly used Bioconductor’s tools specialized to process, analyze and visualize single-cell RNA-seq data.

Preprocessing

Loading necessary libraries

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(scran)
  library(scater)
  library(batchelor)
  library(scDblFinder)
  library(sctransform)
  library(muscat)
  library(SEtools)
  library(cowplot)
  library(BiocParallel)
  library(ComplexHeatmap)

  library(DropletUtils) # for read10xCounts
  library(Matrix) # to handle sparse matrix
  library(BiocNeighbors) # for kNN graph
  library(igraph) # for cluster_louvain

  library(devtools)
  library(easyGgplot2) # for pretty barplots

  library(knitr) # for markers table output
  
  library(SingleR) # for reference genes
  library(celldex)
})
# set local path
set.seed(11)
local.path <- getwd()
setwd(local.path)

# use source() function to source the functions we want to execute
source("functions/functions_qc.R")
source("functions/functions_analysis.R")

Loading the data

For the following chunk, three options are available. The raw or filtered data provided by 10X can be read and used directly. Otherwise, the same data are read but also preprocess using our own quality check. The third option is used to save time in case changes were made only on the analysis part. It requires the preprocessed data to have been saved (achieved by simply runing option 2 once). After comparing the results produced from raw and filtered data, we decided to only present the ones derived from raw data in this report. Also, only a selection of the plots are presented. We leave to the interested reader the option of running the source code available in the GitHub repository.

## data can be downloaded from : http://imlspenticton.uzh.ch/dump/files_for_levesque.tar

# Pick your poison
option_selected = 3

# Decide if the code should be run using raw data or filtered data from 10X
# default is TRUE
use_raw_data = TRUE

# and let it begin
switch(option_selected,
       # Option 1: read data without doing any QC
         sces <- read_10x(use_raw_data), 
       # Option 2: read data, run QC and save clean data for later use via Option 3
       # n_cores: numbers of cores to use (parallelization used to speed up emptyDrops function), 
       # default is 4
       {
         n_cores = 4
         sces <- read_and_qc(n_cores, use_raw_data)
       },
       # Option 3: read clean data produced by previous run of Option 2
       # Warning! Option 2 should have been run at least once before trying Option 3
       sces <- read_previous_data(use_raw_data),
)
sces <- lapply(sces, split_data)

# only select the gene expression data, drop the antibody capture
sce.patient1_HS <- sces[[1]]
sce.patient1_SCC <- sces[[2]]
sce.patient2_HS <- sces[[3]]
sce.patient2_AK <- sces[[4]]

# add name to sce for output graph
attr(sce.patient1_HS, "name") <- "Patient1 HS"
attr(sce.patient1_SCC, "name") <- "Patient1 SCC"
attr(sce.patient2_HS, "name") <- "Patient2 HS"
attr(sce.patient2_AK, "name") <- "Patient2 AK"

#make sure every rowname has the format ENS_ID.Gene_Name
rownames(sce.patient1_HS) <- paste_rownames(sce.patient1_HS)
rownames(sce.patient1_SCC) <- paste_rownames(sce.patient1_SCC)
rownames(sce.patient2_HS) <- paste_rownames(sce.patient2_HS)
rownames(sce.patient2_AK) <- paste_rownames(sce.patient2_AK)

We look at both the raw and unfiltered data from the cell ranger output, to see what yields the best results. For both data sets, we first ran them through our quality control function. Since the filtered data from 10X already has all the empty droplets removed, this step was skipped for it, but not for the raw data. In the following section we will quickly go over what steps we took in our quality control function and how we made sure they were sensible.

The first step implemented in the QC is to look at the ADT data. Ideally we’d like to identify the cells which failed to capture the ADTs and remove them. This step is usually done by removing cells with ADT counts less than or equal to half of the total number of tags. In our case, this approach lead to the loss of a huge number of cells as shown in the summary output. We produced a plot showing the distribution of the number of detected ADTs and observed that most cells had negligible ADT counts. The distribution of ADT count had a smaller peak at around 50 though. We decided to change the threshold at which cells will be discarded to 25, to increase the number of retained cells. But even with this change too many cells were to be discarded. Therefore, we finally decided to ignore this part of the data completely.

In the next step, the empty droplets were removed. As mentioned before, this is only relevant for the raw data. We ran the function emptyDrops() to do so, which tests if the counts of a cell are significantly different from the ambient mRNA pool. If it is not the case, then we’re most likely facing an empty droplet. The summary statistics outputted showed that there were non-significant cell barcodes, which were bounded by the number of iterations. We solved this issue by increasing the number of iterations (to 100’000) in emptyDrops(). The distribution of p-values should be approximately uniform, which was checked using an histogram and was luckily the case with our data. It is interesting to note that this part of the quality control pipeline is responsible for the most amount of discarded cells.

Next we removed the dead cells, which are selected via their high mitochondrial gene expression. The approach we used was the same as introduced in the lecture. Lastly, we also removed cells which had low library sizes and detection rates. We used the perCellQCMetrics() and isOutlier() functions to do so. The plot showing the log-fold-change of the discarded vs. kept cells helped to investigate the possibility of an entire cell type being accidentally discarded because of the quality control. If that were the case, we would have observed an enrichment of a certain cell type in the discarded group, which can be identified by an increased expression of the corresponding marker genes.

Normalization

In order to normalize our data, we first tried Library-Size normalization, which assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells (Robert Amezquita (2020)). After observing that the separation of the clusters wasn’t satisfying, we decided to use Variance Stabilizing Transformation (VST). VST helps in the sens that it manage to transform the data into a more gaussian-shaped distribution.

Reduction

The first step of the dimensionality reduction of our data was to select the 2’000 most variable genes because cell type choice is most likely influenced by the genes varying a lot across the data set than the less variable ones. In addition to those highly variable genes (hvg), we decided to included the genes known to be present in certain cell (sub)types rather than others, even when those specific genes were not part of the 2’000 most variable ones. We ran principal component analysis (PCA) using the selected genes and kept the 20 first principal components (PCs) based on the explained variance to transform our data. Finally, we ran tSNE and UMAP on the modified data to get a 2D representation of the data.

# genes not found in our data : "WISP2" "SEPP1" "TYP1" 
# were removed from the list below
genes <- list(
  # classification: https://www.sciencedirect.com/science/article/pii/S0923181120301985?via%3Dihub#bib0050
  # KERATINOCYTE
  keratinocyte = c("DSC3","DSP","LGALS7"),
  keratinocyte_basal = c("KRT5","KRT14","COL17A1"),
  keratinocyte_suprabasal = c("KRT1","KRT10"),
  keratinocyte_differentiated = c("LOR", "SPINK5"),
  keratinocyte_ORS = c("KRT6B"),
  keratinocyte_channel = c("GJB2","GJB6","ATP1B1"),
  keratinocyte_sebaceous_gland = c("MGST1","FASN"),
  keratinocyte_sweat_gland  = c("DCD","AQP5"),
  # for dynamics, from https://www.sciencedirect.com/science/article/pii/S0092867420306723
  keratinocyte_cycling = c("MKI67"),
  # FIBROBLAST
  fibroblast  = c("DCN","LUM","MMP2"),
  # fibroblast subclassification  from: https://www.nature.com/articles/s42003-020-0922-4#
  fibroblast_secretory_reticular = c("SLPI","CTHRC1","MFAP5","TSPAN8"),
  fibroblast_proinflammatory = c("CCL19","APOE","CXCL2","CXCL3","EFEMP1"),
  fibroblast_secretory_papillary = c("APCDD1","ID1","WIF1","COL18A1","PTGDS"),
  fibroblast_mesenchymal = c("ASPN","POSTN","GPC3","TNN","SFRP1"),
  # PERICYTE & vSMC
  pericytevSMC  = c("ACTA2","TAGLN"), #vSMC : vascular smooth muscel cell
  pericytevSMC_pericyte  = c("RGS5"),
  pericytevSMC_vSMC  = c("MYL9","TPM2","RERGL"),
  # ENDOTHELIAL CELL
  endothelial  = c("PECAM1","SELE","CLDN5","VWF"),
  endothelial_lymphatic  = c("PROX1","LYVE1"),
  # MYELOID CELL
  myeloid  = c("HLA-DRA","FCER1G","TYROBP","AIF1"),
  myeloid_dendritic  = c("CD1C"),
  myeloid_langerhans  = c("CD207"),
  myeloid_macrophage = c("CD68","RNASE1","ITGAX"),
  # LYMPHOCYTE
  lymphocyte  = c("CD3D","CD3E","CD52","IL7R"),
  # MELANOCYTE
  melanocyte  = c("DCT","MLANA","PMEL"),
  # SCHWANN CELL
  schwann  = c("CDH19","NGFR"),
  # MITOTIC CELL
  mitotic  = c("UBE2C","PCNA", "CDK1")
)
  
# Known markers: all the gene markers listed above
known_markers <- c(genes$keratinocyte, genes$keratinocyte_basal, genes$keratinocyte_suprabasal, genes$keratinocyte_differentiated, genes$keratinocyte_ORS, genes$keratinocyte_channel, genes$keratinocyte_sebaceous_gland, genes$keratinocyte_sweat_gland, genes$keratinocyte_cycling, genes$fibroblast, genes$fibroblast_secretory_reticular, genes$fibroblast_proinflammatory, genes$fibroblast_secretory_papillary, genes$fibroblast_mesenchymal, genes$pericytevSMC, genes$pericytevSMC_pericyte, genes$pericytevSMC_vSMC, genes$endothelial, genes$endothelial_lymphatic, genes$myeloid, genes$myeloid_dendritic, genes$myeloid_langerhans, genes$myeloid_macrophage, genes$lymphocyte, genes$melanocyte, genes$schwann, genes$mitotic)


sce.patient1_HS <- norm_redu(sce.patient1_HS, 2000, known_markers)
sce.patient1_SCC <- norm_redu(sce.patient1_SCC, 2000, known_markers)
sce.patient2_HS <- norm_redu(sce.patient2_HS, 2000, known_markers)
sce.patient2_AK <- norm_redu(sce.patient2_AK, 2000, known_markers)

Clustering

In order to label each cell by its type in a meaningfull way, we used the low-dimensionality representation we just produced and build a KNN graph from it. Chosing \(k=30\) means that each cell was connected to its 30 closest cells. Once we had a graph, we could use the cluster_louvain function which is based on the modularity measure and a hierachical approach to find community structure of the data ((n.d.)). At this stage, the data were clustered based only on their structure, not on their specific markers (c.f. exemple for Patient1 HS below). To make the link from one cluster to a cell type, we aggregate the gene expression level of cells within a cluster to get a single value. This aggregation summarizes the relative expression of each known markers for each cluster. The next step is to bring clusters having the same expression profile under the same cell type label. There is most likely room for improvement when looking at the Patient2 AK heatmaps presented below. For instance clusters with no strong expression for any of the known markers could be left out or labeled as undefined. The assumption was that well defined cell outnumber the badly defined ones, giving statistically meaningful results in the end. Once the major cell types were identified, we took a closer look at the keratinocyte and fibroblast subtypes using the same steps but with markers linked to their respective subpopulations. We decided to only show the UMAP plot for the 2D representation of the data.

results <- sc_cluster(sce.patient1_HS, 15)

sce.patient1_HS <- results[[1]]
g.patient1_HS <- results[[2]]

results <- sc_cluster(sce.patient1_SCC, 15)
sce.patient1_SCC <- results[[1]]
g.patient1_SCC <- results[[2]]

results <- sc_cluster(sce.patient2_HS, 15)
sce.patient2_HS <- results[[1]]
g.patient2_HS <- results[[2]]

results <- sc_cluster(sce.patient2_AK, 15)
sce.patient2_AK <- results[[1]]
g.patient2_AK <- results[[2]]

Cluster annotation

The markers found in the literature (Kim, Chung, and Kim (2020), Solé-Boldo (2020)) were used to separate the different cell types and their subtypes within them. A few markers didn’t appear at all in our data (WISP2 - Secretory-reticular fibroblast, SEPP1 - Macrophage, TYP1 - Melanocyte) so the corresponding cell types were selected using their remaining markers present. Since those markers found in the literature are known to correlate with specific cell types, they were added to the list of high variable genes used to produce a lower dimensional representation of the data, even when their expression didn’t vary enough to be selected as highly variable.

Types Subtypes Markers
Keratinocyte DSC3 DSP LGALS7
basal KRT5 KRT14 COL17A1
suprabasal KRT1 KRT10
differentiated LOR SPINK5
ORS KRT6B
channel GJB2 GJB6 ATP1B1
sebaceous gland MGST1 FASN
sweat gland DCD AQP5
cycling MKI67
differentiating KRT1
Fibroblast DCN LUM MMP2
secretory reticular SLPI CTHRC1 MFAP5 TSPAN8
proinflammatory CCL19 APOE CXCL2 CXCL3 EFEMP1
secretory papillary APCDD1 ID1 WIF1 COL18A1 PTGDS
mesenchymal ASPN POSTN GPC3 TNN SFRP1
Pericyte & vSMC ACTA2 TAGLN
pericyte RGS5
vSMC MYL9 TPM2 RERGL
Endothelial cell PECAM1 SELE CLDN5 VWF
lymphatic PROX1 LYVE1
Myeloid HLA-DRA FCER1G TYROBP AIF1
dendritic CD1C
langerhans CD207
macrophage CD68 RNASE1 ITGAX
Lymphocyte CD3D CD3E CD52 IL7R
Melanocyte DCT MLANA PMEL
Schwann cell CDH19 NGFR
Mitotic cell UBE2C PCNA CDK1
genes_type <- list(keratinocyte = genes$keratinocyte, 
                   fibroblast = genes$fibroblast, 
                   pericytevSMC = genes$pericytevSMC, 
                   endothelial = genes$endothelial, 
                   myeloid = genes$myeloid, 
                   lymphocyte = genes$lymphocyte, 
                   melanocyte = genes$melanocyte, 
                   schwann = genes$schwann, 
                   mitotic = genes$mitotic)
  
results <- annotate_cells(sce.patient1_HS, g.patient1_HS, genes_type)
sce.patient1_HS <- results[[1]]
km.patient1_HS <- results[[2]]

results <- annotate_cells(sce.patient1_SCC, g.patient1_SCC, genes_type)
sce.patient1_SCC <- results[[1]]
km.patient1_SCC <- results[[2]]

results <- annotate_cells(sce.patient2_HS, g.patient2_HS, genes_type)
sce.patient2_HS <- results[[1]]
km.patient2_HS <- results[[2]]

results <- annotate_cells(sce.patient2_AK, g.patient2_AK, genes_type)
sce.patient2_AK <- results[[1]]
km.patient2_AK <- results[[2]]

Pseudo-bulk aggregation

# Patient 1 HS
results <- pseudobulk(sce.patient1_HS, km.patient1_HS)

sce.patient1_HS <- results[[1]]
pb.patient1_HS <- results[[2]]
mat.patient1_HS <- results[[3]]
cl2.patient1_HS <- results[[4]]

#Patient 1 SCC
results <- pseudobulk(sce.patient1_SCC, km.patient1_SCC)

sce.patient1_SCC <- results[[1]]
pb.patient1_SCC <- results[[2]]
mat.patient1_SCC <- results[[3]]
cl2.patient1_SCC <- results[[4]]

#Patient 2 HS
results <- pseudobulk(sce.patient2_HS, km.patient2_HS)

sce.patient2_HS <- results[[1]]
pb.patient2_HS <- results[[2]]
mat.patient2_HS <- results[[3]]
cl2.patient2_HS <- results[[4]]

#Patient 2 AK
results <- pseudobulk(sce.patient2_AK, km.patient2_AK)

sce.patient2_AK <- results[[1]]
pb.patient2_AK <- results[[2]]
mat.patient2_AK <- results[[3]]
cl2.patient2_AK <- results[[4]]

Proportion of cell populations

One of the interesting biological questions is how the cell subtype composition changes between healthy skin and diseased skin. For this, simply extracted the cell subtypes for each sample and compare the proportion of cells for each subtype. We used the pseudobulk workflow suggested in the lecture, which ‘bulks’ gene expression levels for each cluster.

A Fisher’s exact test helps us verify that for this particular patient and collection method, the proportion of certain subtypes really differs among conditions.

#Cell Subtype composition varies a lot. For Patient 1, it seems that their diseased skin (SCC) contains a far higher proportion of lymphocytes, endothelial cells and myleloid cells. Conversely, the proportion of keratinocytes is far lower. For Patient 2 with (AK), the lymphocytes is far higher, similar for fibroblasts and endothelial cells but lower for all other cell type when compared to healthy skin.

#Mitotic Cells and Keratinocytes are hard to differentiate in Patient 2 with AK. When it comes to mitotic cells, the marker genes used to quantify mitotic cells play a significant role. Using the suggested marker genes from Kim, Chung, and Kim (2020), we find mitotic cells in both healthy samples, but not in the diseased samples (AK + SCC). We experimented with different marker genes for the mitotic cells, adding some we found in Hsiao et al. (2020), but since our clustering algorithm didn’t find clusters different enough from each other, we are left with either labeling some cells as all mitotic or all keratinocytes for patient 2.

# create barplots of cell type proportions
df.celltypes.patient1 <- rbind(df_barplot_celltypes(sce.patient1_HS), df_barplot_celltypes(sce.patient1_SCC))
df.celltypes.patient1$Type <- c(rep("HS", 8), rep("SCC", 8))

df.celltypes.patient2 <- rbind(df_barplot_celltypes(sce.patient2_HS), df_barplot_celltypes(sce.patient2_AK))
df.celltypes.patient2$Type <- c(rep("HS", 8), rep("AK", 8))

dynamic_barplot(df.celltypes.patient1, "Patient 1", T, "Proportion of cell types in", c(HS ='#999999', SCC = '#E69F00'))

dynamic_barplot(df.celltypes.patient2, "Patient 2", T, "Proportion of cell types in", c(HS ='#999999', AK = '#E69F00'))

# p-value calculation
p_val_kera <- fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'keratinocyte')$p.value
p_val_lympho <- fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'lymphocyte')$p.value
p_val_fibro <- fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'fibroblast')$p.value
p_val_endothelial <- fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'fibroblast')$p.value
p_val_myeloid <- fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'myeloid')$p.value

p_val_lympho2 <- fisher_test_subtypes(sce.patient2_HS, sce.patient2_AK, 'lymphocyte')$p.value

#Cell Subtype composition varies a lot. For Patient 1, it seems that their diseased skin (SCC) contains a far higher proportion of lymphocytes (p=2.370402410^{-66}), endothelial cells (p=2.822524810^{-6}) and myeloid cells (p=8.174363210^{-11}) . Conversely, the proportion of keratinocytes is far lower (p=2.822524810^{-6}). For Patient 2 with (AK), the lymphocytes is far higher (p=0), similar for fibroblasts and endothelial cells but lower for all other cell type when compared to healthy skin.

#Mitotic Cells and Keratinocytes are hard to differentiate in Patient 2 with AK. When it comes to mitotic cells, the marker genes used to quantify mitotic cells play a significant role. Using the suggested marker genes from Kim, Chung, and Kim (2020), we find mitotic cells in both healthy samples, but not in the diseased samples (AK + SCC). We experimented with different marker genes for the mitotic cells, adding some we found in Hsiao et al. (2020), but since our clustering algorithm didn’t find clusters different enough from each other, we are left with either labeling some cells as all mitotic or all keratinocytes for patient 2.

Alternative Cell Type Assignment

When we realized we had some issues with not all subclusters being represented, we realized the above pseudobulk approach may be limited when it comes to assigning cells their most appropriate type. For one, the ‘bulking’ converts a heterogenous cell population into a single cluster and individual differences in RNA expression data is lost. Additionally, the approach relying on a few single marker genes again captures only a limited part of the gene expression profile and simply summing up expression values across different genes doesn’t take into account that some marker genes may be more important, or that a cell type with three marker genes will have a higher a sum than one with two marker genes given similar expression values for those genes.

Hence, we looked for alternative approaches. At Mark’s suggestion, we implemented SingleR (“SingleR” 2021), which assigns each cell to a cell type individually. It does this by leveraging previously published data about expression profiles of pure cells and their corresponding cell type . In this case we are using the celldex package and tried out both the BluePrintEncode Data set as well as the HumanPrimaryCellAtlasData. This original data was generated by Martens and Stunnenberg (2013) and (???) respectively and then further processed as in Aran et al. (2019).

We include this data also to show that the way one decides to assign cell type membership has an important influence on results and what conclusions are then drawn from them.

ref <- celldex::BlueprintEncodeData()
sce.patient1_HS.ref <- plot_reference_clusters(sce.patient1_HS)

sce.patient1_SCC.ref <- plot_reference_clusters(sce.patient1_SCC)

sce.patient2_HS.ref <- plot_reference_clusters(sce.patient2_HS)

sce.patient2_AK.ref <- plot_reference_clusters(sce.patient2_AK)

Optionally, we can run with the human cell atlas (not running for brevity here.)

ref <- celldex::HumanPrimaryCellAtlasData()
sce.patient1_HS.ref2 <- plot_reference_clusters(sce.patient1_HS)
sce.patient1_SCC.ref2 <- plot_reference_clusters(sce.patient1_SCC)
sce.patient2_HS.ref2 <- plot_reference_clusters(sce.patient2_HS)
sce.patient2_AK.ref2 <- plot_reference_clusters(sce.patient2_AK)

Further, we can see what a huge impact the annotation method has. While the difference in lymphocytes between healthy skin and AK was extremely pronounced using the pseudobulk method, the reference annotation shows the difference being much less pronounced. Overall, there do seem to be a bit fewer immune cells in the AK condition. We would love to investigate this more, but because of a lack of time we just show these very contradicting results for now.

lymphocytes = c("B-cells", "CD8+ T-cells","CD4+ T-cells", "NK cells")

immune_cells = c("B-cells", "CD8+ T-cells","CD4+ T-cells", "DC", "Eosinophils", "Macrophages", "NK cells", "Neutrophils")

lymphocyte_cells_patient2_HS <- ncol(sce.patient2_HS.ref[, sce.patient2_HS.ref$cluster_ref %in% lymphocytes])

lymphocyte_cells_patient2_AK <- ncol(sce.patient2_AK.ref[, sce.patient2_AK.ref$cluster_ref %in% lymphocytes])

immune_cells_patient2_HS <- ncol(sce.patient2_HS.ref[, sce.patient2_HS.ref$cluster_ref %in% immune_cells])

immune_cells_patient2_AK <- ncol(sce.patient2_AK.ref[, sce.patient2_AK.ref$cluster_ref %in% immune_cells])

total_HS <- ncol(sce.patient2_HS.ref)
total_AK <- ncol(sce.patient2_AK.ref)

proportion_HS <- c(lymphocyte_cells_patient2_HS, immune_cells_patient2_HS)/total_HS
proportion_AK <- c(lymphocyte_cells_patient2_AK, immune_cells_patient2_AK)/total_AK

dfHS <- data.frame(names=c("Lymphocytes", "Immune cells"), proportion=c(proportion_HS))
dfAK <- data.frame(names=c("Lymphocytes", "Immune cells"), proportion=c(proportion_AK))

df.celltypes.patient2 <- rbind(dfHS, dfAK)
df.celltypes.patient2$Type <- c(rep("HS", 2), rep("AK", 2))


dynamic_barplot(df.celltypes.patient2, "Patient 2", T, "Proportion of lympthocytes / immune cells in", c(HS ='#999999', AK = '#E69F00'))

Sub-population clustering

Keratinocytes subcomposition

Keratinocytes constitute 90% of epidermal skin cells. The epidermis is the outermost layer of the skin. They form a barrier against environmental damage by heat, UV radiation, water loss, various parasites such as viruses, fungi and pathogenic bacteria.

known_markers.kera <- c(genes$keratinocyte_basal, genes$keratinocyte_suprabasal, genes$keratinocyte_differentiated, genes$keratinocyte_ORS, genes$keratinocyte_channel, genes$keratinocyte_sebaceous_gland, genes$keratinocyte_sweat_gland)

# the following list is used to have prettier graph annotation
  genes.kera <- list(
    basal = genes$keratinocyte_basal,
    suprabasal = genes$keratinocyte_suprabasal,
    differentiated = genes$keratinocyte_differentiated,
    ORS = genes$keratinocyte_ORS,
    channel =  genes$keratinocyte_channel,
    sebaceous_gland = genes$keratinocyte_sebaceous_gland,
    sweat_gland  = genes$keratinocyte_sweat_gland
    )


  results <-   cluster_subtype(sce.patient1_HS, "keratinocyte", known_markers.kera, genes.kera)

  sce.patient1_HS.kera <- results[[1]]
  g.patient1_HS.kera <- results[[2]]
  
  results <-   cluster_subtype(sce.patient1_SCC, "keratinocyte", known_markers.kera, genes.kera)

  sce.patient1_SCC.kera <- results[[1]]
  g.patient1_SCC.kera <- results[[2]]
  
  results <-   cluster_subtype(sce.patient2_HS, "keratinocyte", known_markers.kera, genes.kera)

  sce.patient2_HS.kera <- results[[1]]
  g.patient2_HS.kera <- results[[2]]
  
  results <-   cluster_subtype(sce.patient2_AK, "keratinocyte", known_markers.kera, genes.kera)

  sce.patient2_AK.kera <- results[[1]]
  g.patient2_AK.kera <- results[[2]]

Keratinocytes Dynamics

# the following list is used to have prettier graph annotation
genes.dyn <- list(
  # classification: https://www.sciencedirect.com/science/article/pii/S0092867420306723
  basal = genes$keratinocyte_basal[3], # we only select "COL17A1" marker to define the basal population
  cycling = genes$keratinocyte_cycling,
  differentiating = genes$keratinocyte_suprabasal[1] # we only select "KRT1"
  )

kera.dyn.patient1_HS <- dynamic_plot(sce.patient1_HS.kera, g.patient1_HS.kera, genes.dyn)

## class: SingleCellExperiment 
## dim: 16102 931 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(16102): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000271254.AC240274.1 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(931): AAACCTGCATAGAAAC-1 AAACCTGGTCTCATCC-1 ...
##   TTTGGTTTCCGTAGGC-1 TTTGTCAAGGCTCATT-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient1_SCC <- dynamic_plot(sce.patient1_SCC.kera, g.patient1_SCC.kera, genes.dyn)

## class: SingleCellExperiment 
## dim: 15587 490 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(15587): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000277666.AC136352.3 ENSG00000273554.AC136616.1
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(490): AAACGGGGTCCGTGAC-1 AAACGGGGTCGGCACT-1 ...
##   TTTGCGCGTCGGCATC-1 TTTGGTTAGTCATGCT-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient2_HS <- dynamic_plot(sce.patient2_HS.kera, g.patient2_HS.kera, genes.dyn)

## class: SingleCellExperiment 
## dim: 15247 1577 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(15247): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000271254.AC240274.1 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(1577): AAACCTGAGACATAAC-1 AAACGGGAGGAGCGAG-1 ...
##   TTTGTCAGTACGCACC-1 TTTGTCATCAGTCAGT-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient2_AK <- dynamic_plot(sce.patient2_AK.kera, g.patient2_AK.kera, genes.dyn)

## class: SingleCellExperiment 
## dim: 15354 412 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(15354): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000273554.AC136616.1 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(412): AAAGTAGCACCGCTAG-1 AACCATGCAGGTTTCA-1 ...
##   TTTGGTTCAACAACCT-1 TTTGGTTCATGAACCT-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient1 <- rbind(kera.dyn.patient1_HS, kera.dyn.patient1_SCC)
kera.dyn.patient1$Type <- c(rep("HS", 3), rep("SCC", 3))
kera.dyn.patient2 <- rbind(kera.dyn.patient2_HS, kera.dyn.patient2_AK)
kera.dyn.patient2$Type <- c(rep("HS", 3), rep("AK", 3))


dynamic_barplot(kera.dyn.patient1, "Patient 1", F, "Proportion of Keratinocyte states in", c(HS ='#999999', SCC = '#E69F00'))

dynamic_barplot(kera.dyn.patient2, "Patient 2", F, "Proportion of Keratinocyte states in", c(HS ='#999999', AK = '#E69F00'))

Fibroblast subcomposition

known_markers.fibro <- c(genes$fibroblast_secretory_reticular, genes$fibroblast_proinflammatory, genes$fibroblast_secretory_papillary, genes$fibroblast_mesenchymal)

# the following list is used to have prettier graph annotation
genes.fibro <- list(
   # fibroblast subclassification  from: https://www.nature.com/articles/s42003-020-0922-4#
   secretory_reticular = genes$fibroblast_secretory_reticular,
   proinflammatory = genes$fibroblast_proinflammatory,
   secretory_papillary = genes$fibroblast_secretory_papillary,
   mesenchymal = genes$fibroblast_mesenchymal
)

sce.patient1_HS.fibro <- cluster_subtype(sce.patient1_HS, "fibroblast", known_markers.fibro, genes.fibro)
sce.patient1_SCC.fibro <- cluster_subtype(sce.patient1_SCC, "fibroblast", known_markers.fibro, genes.fibro)
sce.patient2_HS.fibro <- cluster_subtype(sce.patient2_HS, "fibroblast", known_markers.fibro, genes.fibro)
sce.patient2_AK.fibro <- cluster_subtype(sce.patient2_AK, "fibroblast", known_markers.fibro, genes.fibro)

Markers correspondance

The following code was used to make sure that the markers described in the literature were found in our dataset. It allowed to identify that WISP2, SEPP1 and TYP1 had to be discarded.

data_markers <- rowData(sce.patient1_HS)$Symbol
known_markers <- c("DSC3","DSP","LGALS7","KRT5","KRT14","COL17A1","KRT1","KRT10","LOR", "SPINK5","KRT6B","GJB2","GJB6","ATP1B1","MGST1","FASN","DCD","AQP5","DCN","LUM","MMP2","WISP2","SLPI","CTHRC1","MFAP5","TSPAN8","CCL19","APOE","CXCL2","CXCL3","EFEMP1","APCDD1","ID1","WIF1","COL18A1","PTGDS","ASPN","POSTN","GPC3","TNN","SFRP1","ACTA2","TAGLN","RGS5","MYL9","TPM2","RERGL","PECAM1","SELE","CLDN5","VWF","PROX1","LYVE1","HLA-DRA","FCER1G","TYROBP","AIF1","CD1C","CD207","CD68","RNASE1","SEPP1","ITGAX","CD3D","CD3E","CD52","IL7R","DCT","MLANA","PMEL","TYP1","CDH19","NGFR","UBE2C","PCNA")


idx_notfound <- which(!(known_markers %in% rowData(sce.patient1_HS)$Symbol))

print(paste("Done:", length(known_markers)-length(idx_notfound), "/", length(known_markers), "markers found"))
cat("Genes missing in dataset: ", known_markers[idx_notfound])

Summary

We started by cleaning up the raw data and applying clustering algorithm to identify the cell type populations mentioned above. The clustering of main cell type appeared to be quite satisfying whereas the cell subtypes were more challenging to separate. We observed differences in the cell (sub)type proportion in the different samples, in particular for lymphocytes and keratinocytes. The general trend showed an increase of lymphocytes and a decrease of keratinocytes in cancer tissue. We also took a look at the dynamics of the keratinocytes and noticed that

Discussion

The first point which should be discussed is related to the few relatively arbitrary cut-off values that were used. To be more specific, we decided to use the 20 first principal components (PCs) and to build our kNN graph linking a cell to its 30 closest neighbors. Both factors are influencing the results and can be discussed.

Secondly, we think the method used would benefit from a manual selection for cluster aggregation into a specific cell type. Comparing the heatmaps produced before and after aggregation showed that we sometimes faced clusters with no clear marker expression. The cells belonging to those cluster could therefore artificially blow up the proportion of a cell type after aggregation even though they are not expressing the required markers. We think that a manual pick would allow a more refine selection and the introduction of an additional label bringing all the undefined cells together.

An other point to discuss is the change in cell type proportion. We saw that changes were important for lymphocytes and keratinocytes between healthy and cancerous tissues, but it is important to keep in mind that we only have the data of two patients and therefore not a very statistically reliable result. Also as mentioned before, e.g. clusters that have no strong expression for any of the known markers we looked at, could have an effect on the number of cells counted towards a certain cell type. This could potentially give us a false impression on how different a cell type is in the healthy and cancer sample. As mentioned before this would be something to look into some more, but the assumption is that it won’t have that much of an effect.

Impact for therpeutic interventions

One of the major difficulties in the successful treatment of cancer is the lack of understanding of its evolution. How does it emerge, metastasize and relapse? Tumor heterogeneity arises from various genetic and epigenetic changes as well as external cues and selective forces. Getting a better grasp on this variability could help us understand disease state and ideally predict treatment outcomes for patients.

At what stage of cancer is a certain patient in and course is it taking? The data presented here might be an additional tool that may aid physicians in skin cancer diagnosis. Further, a certain patient whose lymphocytes are proliferating a lot faster may need different treatment than a patient whose keratinocytes are a driving force in tumor growth. At this point, our analysis shows that both precancer and cancer conditions have strong effect on cell type composition. We recommend further research to help establish more granular diagnosis of AK and SCC, based on single cell RNA-seq/CITE-seq data.

Conclusion

References

Aran, Dvir, Agnieszka P. Looney, Leqian Liu, Esther Wu, Valerie Fong, Austin Hsu, Suzanna Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2): 163–72. https://doi.org/10.1038/s41590-018-0276-y.

Hsiao, Chiaowen Joyce, PoYuan Tung, John D. Blischak, Jonathan E. Burnett, Kenneth A. Barr, Kushal K. Dey, Matthew Stephens, and Yoav Gilad. 2020. “Characterizing and Inferring Quantitative Cell Cycle Phase in Single-Cell RNA-Seq Data Analysis.” Genome Research 30 (4): 611–21. https://doi.org/10.1101/gr.247759.118.

Kim, Doyoung, Kyung Bae Chung, and Tae-Gyun Kim. 2020. “Application of Single-Cell Rna Sequencing on Human Skin: Technical Evolution and Challenges.” Journal of Dermatological Science 99 (2): 74–81. https://doi.org/https://doi.org/10.1016/j.jdermsci.2020.06.002.

Martens, Joost H. A., and Hendrik G. Stunnenberg. 2013. “BLUEPRINT: mapping human blood cell epigenomes.” Haematologica 98 (10): 1487. https://doi.org/10.3324/haematol.2013.094243.

McFarland, James M., Brenton R. Paolella, Allison Warren, Kathryn Geiger-Schuller, Tsukasa Shibue, Michael Rothberg, Olena Kuksenko, et al. 2020. “Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action.” Nat. Commun. 11 (4296): 1–15. https://doi.org/10.1038/s41467-020-17440-w.

Robert Amezquita, Stephanie Hicks, Aaron Lun. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” http://bioconductor.org/books/release/OSCA/.

“Skin Cancer Facts & Statistics - the Skin Cancer Foundation.” 2020. Skin Cancer Foundation. https://www.skincancer.org/skin-cancer-information/skin-cancer-facts.

Solé-Boldo, Raddatz, L. 2020. “Single-Cell Transcriptomes of the Human Skin Reveal Age-Related Loss of Fibroblast Priming.” Communications Biology 3 (188). https://doi.org/https://doi.org/10.1038/s42003-020-0922-4.